Fully Automatic Expression-Invariant Face 

Correspondence 



Augusto Salazar*^ Stefanie Wuhrer^-^ Chang Shu^ Flavio Prieto 

February 8, 2012 



Abstract 

We consider the problem of computing accurate point-to-point correspondences 
among a set of human face scans with varying expressions. Our fully automatic 
approach does not require any manually placed markers on the scan. Instead, the 
approach learns the locations of a set of landmarks present in a database and uses 
this knowledge to automatically predict the locations of these landmarks on a newly 
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available scan. The predicted landmarks are then used to compute point-to-point cor- 
^ respondences between a template model and the newly available scan. To accurately 

I— I fit the expression of the template to the expression of the scan, we use as template a 

blendshape model. Our algorithm was tested on a database of human faces of different 
^ ethnic groups with strongly varying expressions. Experimental results show that the 

obtained point-to-point correspondence is both highly accurate and consistent for most 
^ of the tested 3D face models. 

O 1 Introduction 

We consider the problem of computing point-to-point correspondences among a set of human 
face scans with varying expressions in a fully automatic way. This problem arises from 
^ building a statistical model that encodes face shape and expression simultaneously using a 

^ database of human face scans. In order to build a statistical model, we rely on the correct 

computation of dense point-to-point correspondences among the subjects of a database. 
That is, the raw scans have to be parameterized in such a way that likewise anatomical 
parts correspond across the models [Ij. Facial expression affects the geometry of the human 
face and therefore is important for facial shape analysis. A statistical model of face shapes 
and expressions can be used in applications such as face recognition, expression recognition, 
or reconstructing accurate 3D models of faces from input images [2[ EJ HJ EJ [6] . 
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Computing accurate point-to-point correspondences for a set of face shapes in varying 
expressions is a challenging task because the face shape varies across the database and each 
subject has its own way to perform facial expressions. The problem is further complicated 
by incomplete and noisy data in the scans. 

While many approaches have been proposed to compute point-to-point corresponden- 
ces [7J , only few of them have been applied to statistical model building and shape analysis 
of human face shapes. Blanz and Vetter [2J, built a statistical model called morphable model 
for a set of 3D face scans with varying expressions. The correspondence algorithm is based 
on using optical flow on the texture information of the faces. This assumes that the faces are 
approximately spatially aligned. Xi and Shu built a statistical model based on principal 
component analysis for a set of 3D face scans with neutral expressions. The correspondence 
algorithm is based on fltting a template model to the scans using a non-rigid iterative closest 
point algorithm. To start this algorithm, the faces need to be approximately aligned using 
a set of manually placed marker positions. Both of these registration approaches fail for 
misaligned models. 

In this work, we develop a novel technique to compute correspondences between a set of 
facial scans with varying expressions that does not require the scans to be spatially aligned. 
Our correspondence computation procedure uses a template model P as prior knowledge on 
the geometry of the face shapes. Unlike Xi and Shu [8j, we aim to flnd correspondences 
for faces with varying expressions. Hence, it is not enough to have a template model that 
captures the face shape of a generic model, but we also need to capture the expressions of a 
generic model. To achieve this, we model P as a blendshape model as in Li et al. [9j. In a 
blendshape model, expressions are modeled as a linear combination of a set of basic expres- 
sions. Hence, blendshape models are both simple and eflFective to model facial expressions. 

Our approach proceeds as follows. We flrst use a database of human face scans with man- 
ually placed landmark positions to learn local properties and spatial relationships between 
the landmarks using a Markov network. Given an input scan F without manually placed 
landmarks, we flrst predict the landmark positions on F by carrying out statistical inference 
over the trained Markov network. Section 3 discusses this step. In order to perform statis- 
tical inference, we need to restrict the search region for each landmark. This is detailed in 
Section 4. The predicted landmarks are used to align P to F. In order to flt the expression 
of P to the expression of F, the weights of the generic blendshape model are optimized as 
discussed in Section 5.1. Finally, the shape of P is changed to flt the shape of F as outlined 
in Section 5.2. Fig. [T] shows an overview of the method. 



2 Related Work 

This section reviews literature in face shape analysis related to flnding landmarks on face 
models, computing correspondences between three-dimensional shapes, and using blendshape 
models for facial animation. 
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Figure 1: Overview of the fully automatic expression-invariant face correspondence approach. 
2.1 Finding Landmarks on Face Models 

Traditionally, facial feature detection is done in 2D images, but recent developments on 3D 
data acquisition have allowed to overcome the problems attached to the 3D technologies. 
Existing registration methods demonstrated that landmark-based methods provide more 
accurate and consistent results. However, only a few approaches consider 3D landmark 
detection, while accounting for expression and pose variations pH] . 

Ben Azouz et al. [H] propose a method to find correspondences by automatically predict- 
ing marker positions on 3D models of a human body. The method encodes the statistics of 
a surface descriptor and geometric properties at the locations of manually placed landmarks 
in a Markov network. This method works only for models with slight variation of posture. 
Mehryar et al. [10] introduce an algorithm to automatically detect eyes, nose, and mouth 
on 3D faces. The algorithm correctly detects the landmarks in the presence of pose, facial 
expression and occlusion variations. This method is useful as initial alignment but not for an 
accurate registration. Creusot et al. [12] present a method to localize a set of 13 facial land- 
mark points under large pose variation or when occlusion is present. Their method learns 
the properties of a set of descriptors computed at the landmark locations and encodes both 
local information and spatial relationships into a graph. The method works well for neutral 
pose. However, in the presence of expression variation, the accuracy decreases considerably. 

As our aim is to obtain accurate point-to-point correspondences, we derived a landmark 
prediction method based on the approach of Ben Azouz et al. [llj. The surface descriptor we 
used is able to catch the local geometry properly [13] and, by combining it with a canonical 
representation [14j, our approach is able to detect landmarks in the presence of facial ex- 
pressions. We select a machine learning-based approach to avoid classic assumptions about 
initial alignments of the scans and using local descriptor extrema as stable feature points. 
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The advantage is that learning-based approaches can easily be extended to other contexts. 
2.2 Correspondence Computation 

Several methods have been proposed to solve the problem of establishing a meaningful cor- 
respondence between shapes. Here, we focus on computing correspondences between human 
face shapes. Methods that do not assume templates usually have the problem that some 
points are not registered accurately. To remedy this, we assume a template model. In the 
following, we only review approaches that use template models (for details about methods 
for correspondence computation see the survey of van Kaick et al. [7j). 

Passalis et al. [15j proposed a 3D face recognition method that uses facial symmetry to 
handle pose variation and missing data. A template is fitted to the shape of the input model 
as follows: an Annotated Face Model (AFM) ^16j is iteratively deformed towards the input 
using automatically predicted landmarks and an algorithm based on Simulated Annealing. 
When dealing with facial expressions, the performance of the recognition system decreases. 
This is due to an incorrect registration of the mouth region. The authors do not show 
extensive evaluations of this fully-automatic registration method as this is not the main part 
of their work. 

Statistical learning-based approaches have been effectively used to model facial variations 
oriented to both the synthesis and recognition of faces. Blanz and Vetter [2] developed a 
3D morphable model (3DMM) for the synthesis of 3D faces from photographs. As the 
registration is specific to the scanning setup, rigid alignment of the scans is assumed. Lu 
and Jain [17] present an approach to perform face recognition using 3D face scans. The 
approach builds a 3DMM for each subject in the database. When a test image becomes 
available, the approach matches the scan to a specific individual using the learned 3DMM. 
Unlike our method, their training data is parameterized using manually placed landmarks 
and the test scans are parameterized using individual-specific deformation models. Basso et 
al. [18] extend the method of Blanz and Vetter [2j to register 3D scans of faces with arbitrary 
identity and expression. The rigid alignment of the scans is also assumed for registration. 
To avoid the use of texture information, Amberg et al. present a method to fit a 3DMM 
to 3D face scans using only shape information. They demonstrate the performance of the 
method in the presence of expression variation, occlusion and missing data, but do not 
conduct extensive evaluations of the registration. 

Registration methods based on iteratively deforming a template to the data are an al- 
ternative to statistical learning-based approaches. Allen et al. present an approach to 
parameterize a set of 3D scans of human body shapes in similar posture. To fit the tem- 
plate to each scan, the method proceeds by using a non-rigid iterative closest point (ICP) 
framework coupled with a set of manually placed marker positions. Xi and Shu ^ extend 
the method of Allen et al. f20] to deform a template model to a head scan. The shape 
fitting is carried out as in Allen et al. [20j but uses radial basis functions to speed up the 
deformation process. Unlike our method, this only allows for neutral expressions and uses 
manually placed markers to align the template to a head scan. Wuhrer et al. propose 
a method to deform a template model to a human body scan in arbitrary posture. The 
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method works in two stages: posture and shape fitting. Posture fitting relies on the location 
of diflFerent landmarks, which are predicted in a fully automatic way using a statistical model 
of landmark positions learned from a population. Our method can be viewed as an extension 
of this approach, but instead of fitting the posture, we fit the expression using blendshapes 
(see Section 2.3). 

Methods that compute a correspondence between two surfaces by embedding the intrinsic 
geometry of one surface into the other one by using Generalized Multi-Dimensional Scaling 
(GMDS) [22] are another alternative to deal with variations due to facial expressions [23] . 
The performance of these methods has been demonstrated for face recognition. As GMDS 
methods do not take care that close-by points on one surface map to close-by points on the 
other, the results are often spatially inconsistent. This prevents such methods from being 
used for shape analysis. 



2.3 Use of Blendshape Models 

Modeling expressions using blendshape models is an alternative to approaches based on 
statistical models where a comprehensive database annotation process has to be carried 
out to extract variational information. In a blendshape model, movements of the diflFerent 
facial regions are assumed to be independent. Any expression is then modeled as a linear 
combination of the differences between a set of basic expressions, called blendshapes^ and 
a neutral expression. That is, to produce an expression, the displacements causing the 
movement are linearly combined. Using a representative set of blendshapes, this simple 
model is effective to model facial expressions. 

Li et al. J9] propose a method to transfer the expression of a subject to an animated 
character. Their framework allows to create optimal blendshapes from a set of example poses 
of a digital face model automatically. To fit the expression of the subject to the character, a 
blendshape optimization is carried out in gradient space, where an optimal blending weight 
is estimated for every template vertex and expression. Weise et al. [24j present a framework 
for real-time 3D facial animation. The method tracks the rigid and non-rigid motion of the 
user's face accurately. They incorporate the expression transfer approach of Li et al. [9j in 
order to find much of the variation from the example expressions. The registration stage 
requires offline training where a generic template is fitted to the face of a specific subject. 
To obtain the results, manual marking of features has to be carried out. 

Because of the advantages of modeling expression using linear blendshapes, we use it to 
aid the shape matching. We only optimize a blending weight per expression. This reduces 
the dimension of the optimization space drastically. Since our database of blendshapes is 
small, the expression fitting stage of our algorithm is efficient and helps to improve the results 
significantly. 
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3 Landmark Prediction 



This section outlines how to predict a set of landmark positions on a face scan. To establish 
the correspondences across the whole database, we fit a template to each model. The fitting 
process begins with the extraction of the locations of eight landmarks shown as red spheres 
in Fig. [2] The locations of the landmarks were selected based on the fact that in the 
presence of facial expressions, the corners of the eyes, and the base and tip of the nose do 
not move drastically. Each landmark is located automatically on the face surface by means 
of a Markov network following the procedure proposed by Ben Azouz et al [H] . The network 
learns the statistics of a property of the surface around each landmark and the structure of 
the connections shown in Fig. [2j 




Figure 2: Face model with landmarks. Locations and landmark graph structure. 



3.1 Learning 

Two important aspects have to be defined for the training of the Markov network. First, 
each landmark = 1, 2, . . . , L), represented by a network node, is described using a node 
potential. We use a surface descriptor called Finger Print (FP) [13j, which is a measure 
related to the area of a geodesic circle centered at the point to be characterized. The 
descriptor at a point pk {k = 1,2,...,A^, is the number of vertices in the model) is 
obtained by computing the distortion of the geodesic disks with respect to their corresponding 
Euclidean disks. The final surface descriptor is a vector of distortions obtained by varying 
the radius of the geodesic disk (see Fig. [3]). The reason we use FP as node potential is 
because it is isometry-invariant. Hence, in scenarios where the surface undergoes changes 
that preserve isometry, FP has been effective to encode the surface information of an object. 
FP is used to predict landmarks on human models in varying poses [25j. 

Second, a link between landmarks, represented by a network edge, is described using an 
edge potential. Although we selected the locations of the landmarks based on the observa- 
tions that nose and eye regions do not change much in the presence of expressions, some 
distortions along the edges of the Markov network, may occur. To minimize the eflFects of 
the face movements, we compute the canonical form [14j of each model and define the edge 
potential as the relative position of landmark li with respect to landmark Ij in the canonical 
form space. We compute the canonical form as the embedding of the intrinsic geometry of the 
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Figure 3: Circles used to compute the Finger Print descriptor. Red and green circles corre- 
spond to the Geodesic and Euclidean circles, respectively. 

face surface to M^. To compute this embedding, we perform least-squares multi-dimensional 
scaling [26j with geodesic distances between vertices as dissimilarities, and the geodesic dis- 
tances are computed using fast marching ^4j . We choose these standard techniques as they 
are efficient. These potentials ensure that the model is isometry-invariant. 

The Markov network training process learns the distributions of both node and edge 
potentials. We assume Gaussian distributions for both the node and edge descriptors in this 
paper, and we learn the distributions using maximum likelihood estimation. We choose this 
distribution based on experimental observations. 

3.2 Prediction 

The Estimation of the location of landmarks on a test model is carried out by using proba- 
bilistic inference over the Markov network. In practice, we perform inference using the loopy 
belief propagation algorithm [27j. This algorithm requires a set of possible labels for each 
node. In our case, this means we need to provide a number of candidate locations for each 
landmark. 

Wuhrer et. al [2T] use the canonical forms to learn the average locations of the landmarks, 
but because of the flipping-invariant property of the canonical forms, it is necessary to 
compute eight different alignments and select the one that leads to the minimum distance 
between the scan and the deformed template. In this work, we design a method to restrict 
the search space based on a rough template alignment. In this way, only one fltting process 
has to be computed, reducing the computing cost by a factor of 8. 

4 Restricting the search region 

There are two reasons to reduce the search space for the landmarks: to increase the efliciency 
of the landmark prediction and to eliminate the ambiguity caused by the facial symmetry. 
We treat the problem of restricting the search region for the landmarks as a 3D face pose 
estimation problem. In our case, the estimated pose does not have to be so accurate since 
the Markov network reflnes the position of the landmarks, but it has to be accurate enough 
to identify the left and right sides of the face. The proposed face pose estimation method 
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finds four landmarks located on the nose region and extracts the information of the face 
symmetry planes by using a template of the landmark graph. Once the nose landmarks are 
labeled, the final position of the entire set of landmarks is obtained by transforming the 
template to the coordinate system of the test model. Fig. |4] shows the main steps of the 
proposed search space restriction method. 

Before explaining the rough template alignment procedure, we introduce a method to 
classify a vertex of a 3D model into a specific class. In our case, the classes correspond 
to the nodes of the Markov network and the 3D model corresponds to a 3D face model. 
The decision rules are derived from a clustering procedure over the Principal Components 
Analysis (PC A) projections of a surface feature and a pre-selection method based on the 
surface primitives. 

As the value of the FP descriptor at each landmark li was computed during the Markov 
network training process, we can model the distributions of the surface descriptors and 
use them to classify a vertex Vk on the face surface into a class i (each landmark corre- 
sponds to a class). PC A is a useful tool to compress a high-dimensional space into a linear 
low-dimensional space. When the space corresponds to a multidimensional feature space, 
sometimes, depending on the distinctiveness of the features, it is possible that elements of 
the same class form clusters in the PCA space. In our case, the FP descriptor can be viewed 
as AS-dimensional vector and PCA is used to reduce the dimensionality to D. Fig. [5] shows 
the results of applying PCA to the data with neutral expression (for information about the 
database, see Section 6.1). 

In PCA space, samples of the same class tend to form groups, which are slightly separate. 
As the eye corners and nose base landmarks are symmetric, there are six groups and some 
groups overlap. Since the data forms clusters, it is possible to define rules in order to assign a 
new sample into a specific class. We define a new cluster, denoted as M- cluster^ by removing 
the samples which are farther than M (M G M^) times the standard deviation from the 
cluster medoid. Medoids are representative objects of a cluster whose average dissimilarity 
to all the objects in the cluster is minimal [28j. For instance. Fig. [5] shows the M-clusters 
formed by setting M = 1. While some M-clusters are partially overlapping, we can see that 
the separation between classes improved over the initial. 

We derive a rule Ei for a class i based on the clustering procedure. The rule Ei is defined 
as the minimum volume enclosing ellipsoid of a M-clusteri (see Fig. [5]). Ei'i^ obtained from 
the representation of the ellipsoid in the center form as {pk — C)^A{pk — C) < 1, where C 
corresponds to the center of the ellipsoid and A is the 3x3 matrix of the ellipse equation. 
When a new point pk becomes available, each Ei is evaluated in order to see if the point 
satisfies the equation. As some M-clusters are overlapping, it is possible that two or more 
labels are assigned to the same pk- Similarly, it is possible that pk is not assigned to any 
class because the point lies in a region that is not of interest. Fig. [6] shows an example of 
the vertex classification results obtained using the proposed method. 

It is not efficient to compute the descriptor value and its projection to PCA space for all 
the vertices of the mesh. To reduce the search space, we compute samples on the surface using 
a curvature-based descriptor. More precisely, we use as samples all surface umbilics 
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Figure 4: Framework of the proposed initial alignment method. 
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Figure 5: PCA-based clustering. Left to right: Landmarks on a face model. Initial clusters 
formed with all the samples. Final cluster after removing the samples beyond a one standard 
deviation from the cluster medoid. Minimum volume enclosing ellipsoids. 
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Figure 6: Example of vertices labeling result. (A) Notice how the points on the nose tip 
region are correctly labeled. (B) Some vertices are assigned to two classes. This situation is 
because of the left-right symmetry of the features. (C) Points located far from the region of 
interest are discarded. 
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which are the points on the surface where the principal curvatures are identical (that is, 
ki = /C2). We choose this sampling approach because it can be observed experimentally 
that most landmark positions are located close to a umbilic, as shown in Fig. [7| . For each 
umbilic the FP descriptor is computed, projected into the PCA space, and labeled following 
the procedure described above. 




Fear Happiness Sadness 



Figure 7: Umbilics of different 3D facial models of the same subject performing different 
expressions. Notice how the umbilics are distributed all over the surface, and in most of the 
cases umbilics are present at the locations of salient facial features. 

Once the vertices have been labeled, the next step is to roughly align a template of the 
upper part of the face with the same structure as the landmark graph structure to the scan 
(see Fig. [2]). The locations of the template vertices are used to define the search space 
region on which statistical inference is performed. We then perform statistical inference on 
these search space regions using belief propagation to predict the landmarks. As discussed 
in Section 13. 2[ 



5 Registration 

In this section, we describe how a template is fitted to a 3D scan of the face. The input 
scan corresponds to a face of a subject performing a facial expression. Fitting a template 
to this scan is challenging because the facial geometry has large variations due to different 
face shapes and facial muscle movements. We propose a registration method, where the 
expression and the shape are fitted separately in order to handle the complexity of the 
problem. Fig. [8] shows an overview of the proposed method. 
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Figure 8: Registration procedure. First, the template and the scan are ahgned using the 
predicted landmarks. Second, the expression is fitted using a blendshape model. Finally, an 
energy-based surface fitting method is used to fit the shape. At the end, the overlap between 
the scan and the template is maximized and a point-to-point correspondence for the face 
shapes in different expressions is obtained. 



5.1 Expression Fitting 

We address the facial expression fitting problem as a facial rigging problem. In facial rigging, 
a facial expression is produced by changing a set of parameters associated with the different 
regions of the face modeled using blendshapes. Conceptually, to generate a facial shape from 
a 3D rest pose face template, we just move a set of vertices to a new location, e.g., lift an 
eyebrow or open the mouth (see Fig. |9]). In this sense and similar to the approach proposed 
by Li et al. (9], we model a facial expression as a linear combination of facial blendshapes 
(denoted by A^), which are expressed as vectors of displacements from the rest pose (denoted 
by Aq). Unlike Li et al. where the blendshape model requires the optimization of a weight 
vector per pose and vertex of the template, we propose a simplified blendshape model where 
only one weight per pose has to be optimized. The aim of our blendshape model is to catch 
the pose variations more than the shape variation. Hence, an expression can be generated 
as 

3 

P = Ao + ^a,A, (1) 

i=l 

where, Aq corresponds to the rest pose, Ai^i > correspond to the displacements, and 
ai {0 < ai < 1) are the blending weights of the pose P. For each blendshape Ai^ Fig.[9]shows 
the corresponding expressions. This formulation transforms the facial expression fitting 
problem into an optimization problem, where the value of each ai has to be estimated. 

To solve the fitting problem, the expression template P is aligned to a scan F by setting 
ai to 0. Both P and F contain a set of landmarks denoted by li and respectively. The 
landmarks k were predicted using the method described in Sections [3] and [4j The alignment 
is carried out by finding a 3 x 4 transformation matrix that minimizes the energy 
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Figure 9: Template: rest pose and a set of generated blendshapes Ai 
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with respect to the 12 parameters in T^. 

Once P and F are aligned, we find the ai that best match the expression of F. To 
achieve this, we divide P into three regions: upper face, chin and mouth (as shown in Fig. 
10). The division is motivated by the fact that the chin and lip regions vary drastically from 
one expression to another (mostly in terms of displacements). Thus it is desirable to inspect 
the quality of the fitting in each of these regions separately. Face regions like eyebrows and 
cheeks also change their shapes to produce the expressions but we expect that these changes 
can be captured during the shape fitting step. 




Figure 10: Regions used in the expression fitting procedure. 
To fit the expression, we define the energy 



expression 



(3Enn + lEchin + VEmouth^ 



(2) 
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where, 



ENN = E{Pr-NN{pr))\ 
EcMn = E{Pa-NN{pa))\ 

Emouth ='e {Pb-NN{p,))\ 

b 

and /5, 7 and rj are weights, r are the vertices of P, and a and 6 are the vertices of the 
chin and mouth region in P, respectively. Here, NN{pi) indicates the nearest neighbor of 
a specific vertex pi. To make the method more robust to both the presence of outhers and 
misoriented surfaces, we only considerer the nearest neighbor in E^n if the angle between 
the outer normal vectors of the vertex pi and its nearest neighbor is at most a threshold (f. 
To force the fit to be exact, the nearest neighbor term for Echin and Ej^outh is only valid if 
the angle is at most (f/2. The expression is fitted by minimizing Eq. [2] with respect to the 
blending weights a^. In our experiments we set ip to 80 degrees. 

The minimization of E expression is carried out in two stages. In the first stage, we inspect 
if some movement occurs in the chin; most of the times the displacements are horizontal. 
Once we know the position of the chin, to refine the match with the expression of the input 
model, we need to inspect the positions of the lips; in this stage both vertical and horizontal 
displacements, and shape changes of the lips are matched. Based on this, the expression 
fitting procedure proceed as follows: First, the weight r] is set to 0, thus the minimization is 
only guided by the E^n and Echin- In this step /3 is set to 1 and 7 is defined as 1 — (K/im/ l^^l), 
where K/im is the number of valid nearest neighbor in the chin region. The second step begins 
when Vchin > 0.8 |a|, which means that the overlap between the chin region of P and the 
model F has reached a good level. At this time, r] is set to 1 — {Vmouth/ \b\)^ where Vmouth is 
the number of valid nearest neighbors in the mouth region. The minimization process ends 
when Vjnouth > 0.6 \b\. This weight variation scheme ensures that the chin and mouth regions 
of P match the expression of F. The threshold values for Vchin and Vmouth were choose based 
on experimental observations. 

5.2 Shape Fitting 

As most of the changes in terms of movement, especially in the chin and mouth regions, were 
captured in the expression fitting stage, the next step consists of adapting the shape of the 
deformed template P to the shape of the scan F. In addition, the changes (displacements) 
resulting from muscle movement in the eyebrows and forehead are also captured. 

The shape fitting is, again, treated as an optimization problem similar to the method 
proposed by Allen et al. ^20j and extended by Li et al. [30j. The goal is to find a set of 3 x 4 
transformation matrices for each vertex pi of P such that it is moved to the new location 
TiPi to fit the shape of F. The transformed version of P is denoted P. The transformation 
matrices are obtained by minimizing an energy function, which is a weighted sum of four 
energy terms. 
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The first term corresponds to the nearest neighbor term 

r 

Emn = J2(^^P^- (^iP^))'^ 

i=l 

where A^A^ {TiPi) indicates the nearest neighbor of a transformed vertex pi = TiPi. This 
term is only considered if the angle between the outer normal vectors of Pi and its nearest 
neighbor is at most 80 degrees. The nearest neighbor energy term ensures that the template 
is deformed to resemble the input scan. 

The second energy term corresponds to the smoothness energy 

Esmooth = ^ ^ I 1 - I (Ti - Tj)^, 

where R (pj) is a set of indices corresponding to points pj within Euclidean and geodesic 
distance sg oipi^ with g being the resolution of the mesh and s being a constant. This energy 
term encourages close- by points to have similar deformations. Note that this term does not 
encourage a smoothing of the geometry of the mesh. We set 5 = 3 in our implementation. 

The third energy term is a regularization term that is conceptually similar to the smooth- 
ness energy. This energy term encourages smooth transformations between neighboring ver- 
tices of the mesh. We call this energy regularization energy Ej.^g and define it as 

Ereg — ^ ^ (T^ — Tj) , 

where is the set of edges of P. This term prevents adjacent parts of P from being 

mapped to disparate parts of F, and also encourages similarly-shaped features to be mapped 
to each other [20] . 

The final energy term encourages the transformation matrices to be rigid. The rigid 
energy E^igid^ which measures the deviation of the column vectors of from orthogonality 
and unit length, is defined as 




where a^, ag are the first three columns vectors of T^. 

The energy terms described above are combined in the weighted sum 

E shape ^qEnN + MEreg + ^2Erigid + ^sEgmooth- (3) 

The shape is fitted by minimizing Escape with respect to the parameters of T^. We start 
by encouraging smooth and rigid transformations by setting Aq = 1, A? = 5000, A2 = 1000, 
and A3 = 100. Similar to Li et al. [30j, whenever the energy change is negligible, we relax 
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the weights as = 0.5X\~^^ = 0.5A2~'^, and A3 = 0.5X^^^ to give more weight to the data 
term. This aUows the template to deform towards the scan. The algorithm iterates until 
the relative change in energy {El'^^^^ — El^^^J / El'^^^^^ where i is the iteration number, is 
less than 0.0001. For each set of weights, we use a quasi-Newton approach \3T] to solve the 
optimization problem, and we perform at most 1000 iterations. 



6 Experiments and results 
6.1 Database 

We use the BU-3DFE [32] database for our experiments. The database consists of 3D face 
models from 100 subjects (56 Females and 44 Males) in neutral pose and with the following 
facial expressions: surprise^ happiness^ disgust^ sadness^ anger and fear. There are four scans 
of each facial expression, corresponding to different levels of intensity from low to highest 
As a file containing the raw data of each scan is also available, there are a total of 50 files 



per subject, 25 raw and 25 corresponding to the cropped faces. Fig. [TT] shows snapshots of 
different scans from the BU-3DFE database. 





^Mf ffff 



Anger Disgust Fear Sadness Surprise Happiness Low Middle High Highest 
RAW data Cropped neutral face Facial Expressions Happiness 

Figure 11: Characteristics of the BU-3DFE database. 



6.2 Landmark prediction accuracy 

We use two different subsets of models of 50 subjects (25 females and 25 males) to train 
the landmark prediction model. First, we use a subset consisting of 50 models of the 
subjects in neutral pose as training set. Second, we use a subset Tg consisting of 350 models 
of subjects in neutral pose and performing six different facial expressions as training set. As 
Tn covers the shape variability and Tg covers both shape and expressions variability, we are 
able to evaluate the importance of the variabilities considered in the training sets. 

To evaluate the accuracy of the landmark prediction algorithm, we compute the average 
error of the distance between a manually located landmark li and its corresponding estimation 
li. Also, the distribution of the relative error [33j Rerr = dist{li^ li)/distref is computed. Here, 
dist{liji) is the Euclidean distance between li and li and distref is the distance taken as 
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reference. The value of distref is chosen based on the face region that contains the analyzed 
set of landmarks. To evaluate the accuracy of the location of the landmarks in the region of 
the eyes, we compute the Euclidean distance between the manually labeled inner and outer 
corners of the eyes for 80 models with neutral expression. The average of these distances is 
used as distref- For the nose region, the procedure is similar, but the distance is computed 
between the landmarks located at the extremes of the nose base. The distances obtained are 
31.973 mm and 27.087 mm, for the eyes and nose regions, respectively. 

We evaluate the accuracy of the landmark prediction algorithm over the remaining 50 
subjects of the database (31 females and 19 males). The prediction is carried out over 350 
models of subjects in both neutral pose and when performing six different facial expressions. 

The landmarks were predicted with an error below distref in 87% and 95% of the cases 



for the experiments with and Tg as training databases, respectively. Figs. \T2\ and \T3\ show 
the curves of the relative error distribution obtained using the different training sets. In both 
experiments, the landmarks located in the nose region are better predicted than the ones 
located in the eyes region. The average of the relative error curves (see Fig. 14) show the 
significant improvement in accuracy of the landmark prediction when Tg is used as training 
set. This indicates that for the configuration of the landmark prediction model used in this 
work, the variations due to both shape and expression have to be considered. 




Right inner 
Riglit outer 
Left inner 
Left outer 



10 15 20 
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Figure 12: Relative error distribution using data set T^. Left: Eye landmarks. Right: Nose 
landmarks. 

In addition to the relative error distribution, we compute the average, the standard 
deviation and the maximum of the error when Tg is used for training (see Table [l|). The tip 
of the nose is predicted with the lowest error and the outer corners of the eyes are predicted 
with the highest error. One of the reasons that the outer corners of the eyes are not predicted 
as well as the other landmarks is that the initial position is found based on the alignment of 
the landmark template (see Fig. [4]). This adds an estimation error that is reflected in the 
high values of the standard deviation. 

Although the obtained landmark prediction error appears to be high, it is still possible to 
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Figure 13: Relative error distribution using data set Tg. Left: Eye landmarks. Right: Nose 
landmarks. 




Relative Error [mm] Relative Error [mm] 



Figure 14: Average of the relative error distribution. Left: Eye landmarks. Right: Nose 
landmarks. 
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Table 1: Error of landmark prediction with training set Tg. 



obtain a proper point-to-point correspondence since the landmarks only provide a guidance 
for the deformation algorithm. Fig. [15] shows some examples of the landmark prediction 
results over models of subjects with different facial shapes and performing different expres- 
sions. For all the registration experiments, for which results are shown in Section 6^, we 
used Tg as training dataset. 

6.3 Registration 

We tested our dense point-to-point correspondence algorithm on the models where the land- 
marks were correctly predicted (332 models). To generate the blendshape model we use the 
blenshapes shown in Fig. [T6| Notice that mostly mouth displacements are considered. As 
the expressions are generated as a linear combination of displacements, to avoid exaggerated 
undesired expressions, it is important that two blendshapes do not add the same kind of 



displacement. The third column of Fig. [T7| shows examples of the expression fitting results 
for six different kinds of facial expression. In all cases, the expression of the mouth region 
of the input model is properly matched after linear blending. 

Next we discuss the quality of the results after the final shape fitting step. The fourth 



column of Fig. [T7| shows examples of the shape fitting results. The models are color-coded 
with respect to the signed distance from the input scan. Note that most points on the models 
are within 1mm of the scan and that the results are visually pleasing. Furthermore, notice 
how the different expression in the eyebrows are properly fitted. In order to visualize the 
quality of the correspondences a chess-board texture was applied to the template model (see 



right of Fig 17). Results of texture transferring show that in most of the face regions, the 
shape of the deformed template matches the shape of the input model. 

We also run tests to verify if the level of the expression affects the quality of the fitting. 



Fig. 18 depicts how the proposed method is able to correctly fit the template to different 
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Figure 15: Examples of the landmark prediction results. Red and green spheres correspond 
to the manually placed and predicted landmarks, respectively. First row: female subjects; 
Second row: male subjects. 




Figure 16: Shapes used to generate the blendshape model. 
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Figure 17: Examples of registration results. The input, fitted expression, 
texture mapped models are provided for each example. 
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levels of expressions. For each example, the input, output and textured models are provided. 
Notice how both slight and pronounced movements of the eyebrows and mouth are properly 
matched. 

Most of the incorrect shape fitting occurs on the inner parts of the lips. As the input 
scans have information in the area of the teeth, which is not considered in the template 
model, the algorithm converges to this region, thereby causing miscorrespondences during 
the shape fitting. Fig. [19] shows an example of the limitations in the shape fitting. Notice 
how the expression is matched correctly, but the corners of the mouth are not well located, 
which causes an incorrect fitting on the mouth and chin regions (first row of Fig. 19). The 
situation becomes critical when the expression is incorrectly matched (second row of Fig. 



19). For our experiments, we obtained visually pleasing shape fitting results in 294 (84%) of 



the tested models. 

Additional tests were performed over models with occluded parts. In this case, the tem- 
plate was correctly fitted when the occlusion did not occur in the locations of the landmarks 
used for the initial alignment. Fig. [20] shows the result of the proposed point-to-point corre- 
spondence approach for a model of a subject where the mouth is occluded by a hand. Note 
that a visually pleasing result is obtained. 



7 Conclusions 

This paper presented a fully automatic method to compute dense point-to-point correspon- 
dences between a set of human face scans with varying expressions. The proposed approach 
proceeds by learning local shape descriptors and spatial relationships for a set of landmark 
points. For a new scan, the approach first predicts the landmark points by performing sta- 
tistical inference on the learned model. The approach then fits a template to the scan in two 
stages. The first stage fits the expression of the template to the expression of the scan using 
the predicted landmark points. The second stage fits the shape of the template to the shape 
of the scan using a non-rigid iterative closest point technique. We applied our approach 
to 350 models of the BU-3DFE database, and evaluated the results both qualitatively and 
quantitatively. We showed that for 95% of the models, the landmarks are predicted with an 
acceptable error, and that for 84% of the models, a visually pleasing correspondence is found. 
Furthermore, we evaluated the algorithm on a challenging case of a face with occlusion. 

The failure cases of the algorithm are mostly caused by noisy data in the mouth area. 
For future work we plan to design algorithms that can handle this challenging scenario. We 
will also test the algorithm on a large database of models with different types of occlusion, 
such as models wearing eyeglasses. 
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Figure 18: Results of fitting to models of the same subject performing an expression in 
different levels. Fear (first three rows). Surprise (Last three rows). For each example, first, 
second, and third rows are the input, output, and textured models, respectively. 
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Figure 19: Incorrect shape fitting. The differences in topology of the input and template 
meshes cause incorrect expression and shape fitting. 




Manually placed Predicted 
landmarks landmarks 



Figure 20: Challenging test scenario. Mapped error models correspond to the fitting result. 
Test was carried out over one model of the Bosphorus database [34j. 
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